Skip to content

HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65

Open
sirreal wants to merge 1 commit into
trunkfrom
fix/html-decoder-legacy-follower-ascii
Open

HTML API: Decode semicolonless legacy references before non-ASCII attribute followers#65
sirreal wants to merge 1 commit into
trunkfrom
fix/html-decoder-legacy-follower-ascii

Conversation

@sirreal

@sirreal sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

What

Fixes WP_HTML_Decoder::read_character_reference() and decode_attribute() for semicolonless legacy named character references followed by non-ASCII bytes.

Issue

HTML only treats a semicolonless named reference in an attribute as ambiguous when the following character is ASCII alphanumeric or =. A non-ASCII follower is not an ambiguous follower, so the legacy reference should still decode.

The current decoder uses ctype_alnum() on the follower byte. ctype_alnum() depends on LC_CTYPE; under C.UTF-8/UTF-8-style locales, bytes such as 0xC2, 0xC3, and 0xF0 can be classified as alphanumeric. That makes the decoder incorrectly leave references such as &Aacute literal before a multibyte UTF-8 follower. It also makes behavior depend on locale instead of the HTML ASCII-only rule.

Reproduction

This payload is &Aacute followed by the UTF-8 bytes C2 80. The locale setup is important because the bug depends on LC_CTYPE:

$previous_locale = setlocale( LC_CTYPE, 0 );
$affected_locale = setlocale( LC_CTYPE, "C.UTF-8", "en_US.UTF-8", "de_DE.UTF-8", "fr_FR.UTF-8" );

if ( false === $affected_locale || ! ctype_alnum( "\xC2" ) ) {
    die( "This platform does not have an affected LC_CTYPE locale.\n" );
}

try {
    $raw = "&Aacute\xC2\x80";
    $decoded = WP_HTML_Decoder::decode_attribute( $raw );

    var_dump( bin2hex( $decoded ) );

    $match_byte_length = null;
    $reference = WP_HTML_Decoder::read_character_reference( "attribute", $raw, 0, $match_byte_length );

    var_dump( bin2hex( $reference ) );
    var_dump( $match_byte_length );
} finally {
    setlocale( LC_CTYPE, $previous_locale );
}

Expected output, matching Dom\\HTMLDocument:

string(8) "c381c280" // Á, then the original C2 80 follower.
string(4) "c381"     // The decoded &Aacute reference.
int(7)                // strlen( "&Aacute" ).

Current trunk under an affected locale returns the literal input instead:

string(18) "26416163757465c280" // &Aacute\xC2\x80 left undecoded.
string(0) ""
NULL

The ASCII ambiguity rule still has to hold for followers such as 0, A, a, and =:

var_dump( WP_HTML_Decoder::decode_attribute( "&Aacute0" ) );
var_dump( WP_HTML_Decoder::decode_attribute( "&AacuteA" ) );
var_dump( WP_HTML_Decoder::decode_attribute( "&Aacutea" ) );
var_dump( WP_HTML_Decoder::decode_attribute( "&Aacute=" ) );

Expected: each value remains unchanged.

Fix

Read the follower as a byte and check explicit ASCII ranges for 0-9, A-Z, a-z, and =. This matches the HTML attribute ambiguity rule and avoids locale-dependent byte classification.

Validation

vendor/bin/phpunit --filter "test_semicolonless_legacy_reference_before_multibyte_attribute_follower|test_semicolonless_legacy_reference_before_ascii_attribute_follower_is_ambiguous" tests/phpunit/tests/html-api/wpHtmlDecoder.php

Result: OK, 5 tests, 15 assertions.

Trac ticket: TBD

Use of AI Tools

AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5
Used for: splitting the fuzzer-discovered fix into a focused PR, drafting reproduction notes, and running validation. Final implementation was reviewed against the branch diff.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@sirreal sirreal force-pushed the fix/html-decoder-legacy-follower-ascii branch from 7169693 to f92da80 Compare June 12, 2026 22:01
@sirreal

sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

I have reproduced this, the reproduction prints:

string(18) "26416163757465c280"

@sirreal sirreal marked this pull request as ready for review June 12, 2026 22:11
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs Trac ticket references and WPCS will whine if whitespace isn’t adjusted. Maybe @copilot can fix these.

Giving it a preemptive approval. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants